Extraction of Translation Equivalents from Non-Parallel Corpora
نویسندگان
چکیده
This paper presents a widely applicable method for extracting bilingual expressions from non-parallel corpora. The algorithm first collects word sequences as candidates for translation equivalents that match given patterns of word sequences from each corpus. Then, translation equivalents are selected from these candidates by aligning component words from within word sequences. We show the results of acquiring Japanese and English compound nouns from unrelated financial newspapers. We also demonstrate that the method can collect pairs that do not appear in terminological dictionaries.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملMeasuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents
In this paper we present and evaluate three approaches to measure comparability of documents in non-parallel corpora. We develop a task-oriented definition of comparability, based on the performance of automatic extraction of translation equivalents from the documents aligned by the proposed metrics, which formalises intuitive definitions of comparability for machine translation research. We de...
متن کاملLearning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary
So far, research on extraction of translation equivalents from comparable, non-parallel corpora has not been very popular. The main reason was the poor results when compared to those obtained from aligned parallel corpora. The method proposed in this paper, relying on seed patterns generated from external bilingual dictionaries, allows us to achieve similar results to those from parallel corpus...
متن کاملExtraction of Translation Equivalents from Parallel Corpora Using Sense-sensitive Contexts
The paper proposes an unsupervised method to extract translation equivalents from parallel corpora. The strategy we use takes into account the context of words. Given a word of the source language and a particular context, we learn its word translation within an equivalent context. We first extract pairs of similar contexts and, then, we compare the similarity between words appearing in each pa...
متن کاملExtracting Multilingual Lexicons from Parallel Corpora
The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hy...
متن کامل